Nature Computational Science
○ Springer Science and Business Media LLC
All preprints, ranked by how well they match Nature Computational Science's content profile, based on 50 papers previously published here. The average preprint has a 0.05% match score for this journal, so anything above that is already an above-average fit. Older preprints may already have been published elsewhere.
Li, J.; Li, T.; Chattopadhyay, I.
Show abstract
As we begin to recover from the COVID-19 pandemic, a key question is if we can avert such disasters in future. Current surveillance protocols generally focus on qualitative impact assessments of viral diversity 1. These efforts are primarliy aimed at ecosystem and human impact monitoring, and do not help to precisely quantify emergence. Currently, the similarity of biological strains is measured by the edit distance or the number of mutations that separate their genomic sequences 2-6, e.g. the number of mutations that make an avian flu strain human-adapted. However, ignoring the odds of those mutations in the wild keeps us blind to the true jump risk, and gives us little indication of which strains are more risky. In this study, we develop a more meaningful metric for comparison of genomic sequences. Our metric, the q-distance, precisely quantifies the probability of spontaneous jump by random chance. Learning from patterns of mutations from large sequence databases, the q-distance adapts to the specific organism, the background population, and realistic selection pressures; demonstrably improving inference of ancestral relationships and future trajectories. As important application, we show that the q-distance predicts future strains for seasonal Influenza, outperforming World Health Organization (WHO) recommended flu-shot composition almost consistently over two decades. Such performance is demonstrated separately for Northern and Southern hemisphere for different subtypes, and key capsidic proteins. Additionally, we investigate the SARS-CoV-2 origin problem, and precisely quantify the likelihood of different animal species that hosted an immediate progenitor, producing a list of related species of bats that have a quantifiably high likelihood of being the source. Additionally, we identify specific rodents with a credible likelihood of hosting a SARS-CoV-2 ancestor. Combining machine learning and large deviation theory, the analysis reported here may open the door to actionable predictions of future pandemics.
Fabbrizzi, M.; Amato, L. G.; Martinelli, L.; Carpaneto, J.; Bartolini, E.; Calderoni, S.; Retico, A.; Vergani, A. A.; Mazzoni, A.
Show abstract
Brain structure plays a pivotal role in shaping neural dynamics. Current models lack the anatomical and functional resolution needed to accurately capture both structural and dynamical features of the human brain. Here, we introduce the FEDE (high FidElity Digital brain modEl) pipeline, generating anatomically accurate brain digital twins from imaging data. Using advanced techniques of anatomical tissue segmentation and finite-element analysis, FEDE reconstructs brain structure with high spatial resolution, while also replicating whole-brain neural activity. We demonstrated its application by creating the first brain digital twin of a toddler with autism spectrum disorder (ASD). Through parameter optimization, FEDE replicated both time-frequency and spatial features of recorded neural activity. Notably, FEDE predicted patient-specific aberrant values of excitation to inhibition ratio, coherently with ASD pathophysiology. FEDE represents a significant leap forward in brain modeling, paving the way for more effective applications of digital twin in experimental and clinical settings.
Wang, H. E.; Woodman, M.; Triebkorn, P.; Lemarechal, J.-D.; Jha, J.; Dollomaja, B.; Vattikonda, A. N.; Sip, V.; Medina Villalon, S.; Hashemi, M.; Guye, M.; Scholly, J.; Bartolomei, F.; Jirsa, V.
Show abstract
One-third of 50 million epilepsy patients worldwide suffer from drug resistant epilepsy and are candidates for surgery. Precise estimates of the epileptogenic zone networks (EZNs) are crucial for planning intervention strategies. Here, we present the Virtual Epileptic Patient (VEP), a multimodal probabilistic modeling framework for personalized end-to-end analysis of brain imaging data of drug resistant epilepsy patients. The VEP uses data-driven, personalized virtual brain models derived from patient-specific anatomical (such as T1-MRI, DW-MRI, and CT scan) and functional data (such as stereo-EEG). It employs Markov Chain Monte Carlo (MCMC) and optimization methods from Bayesian inference to estimate a patients EZN while considering robustness, convergence, sensor sensitivity, and identifiability diagnostics. We describe both high-resolution neural field simulations and a low-resolution neural mass model inversion. The VEP workflow was evaluated retrospectively with 53 epilepsy patients and is now being used in an ongoing clinical trial (EPINOV).
Ojemann, W. K. S.; Xu, Z.; Shi, H.; Walsh, K.; Pattnaik, A. R.; Sinha, N.; Lavelle, S.; Aguila, C.; Gallagher, R.; Revell, A. Y.; LaRocque, J. J.; Korzun, J.; Kulick-Soper, C. V.; Zhou, D. J.; Galer, P. D.; Sinha, S. R.; Shinohara, R.; Davis, K. A.; Litt, B.; Conrad, E. C.
Show abstract
Annotating seizure onset and spread in intracranial EEG is essential for epilepsy surgical planning, yet manual annotation is unreliable and cannot scale to large datasets. We introduce Neural Dynamic Divergence (NDD), an unsupervised framework that detects seizure activity by measuring deviation from patient-specific baseline neural dynamics using autoregressive models. NDD requires no labeled training data and adapts to individual patients, channels, and brain states. Validating against expert consensus annotations from 46 seizures, NDD achieves human-level agreement ({phi} = 0.58 vs. inter-rater{phi} = 0.64) and outperforms existing algorithms on 1,019 seizures with soft labels (AUROC = 0.87). We demonstrate clinical utility by automatically annotating 2,017 seizures, revealing that seizure spread patterns distinguish epilepsy subtypes and predict surgical outcomes. NDD also generalizes to continuous ICU scalp EEG monitoring (AUROC = 0.77). We provide NDD as an open-source Python package to enable scalable seizure annotation across research centers.
Mansour L, S.; Di Biase, M. A.; Yan, H.; Xue, A.; Venketasubramanian, N.; Chong, E.; Alexander-Bloch, A.; Chen, C.; Zhou, J. H.; Yeo, B. T. T.; Zalesky, A.
Show abstract
Normative modeling in neuroscience aims to characterize interindividual variation in brain phenotypes and thus establish reference ranges, or brain charts, against which individual brains can be compared. Normative models are typically limited to coarse spatial scales due to computational constraints, limiting their spatial specificity. They additionally depend on fixed regions from fixed parcellation atlases, restricting their adaptability to alternative parcellation schemes. To overcome these key limitations, we propose spectral normative modeling (SNM), which leverages brain eigenmodes for efficient spatial reconstruction to generate normative ranges for arbitrary new regions of interest. Benchmarking against conventional counterparts, SNM achieves a 98.3% speedup in computing accurate normative ranges across spatial scales, from millimeters to the whole brain. We demonstrate its utility by elucidating high-resolution individual cortical atrophy patterns and characterizing the heterogeneous nature of neurodegeneration in Alzheimers disease. SNM lays the groundwork for a new generation of spatially precise brain charts, offering substantial potential to drive advances in individualized precision medicine.
Abivardi, A.; Webster, M.; McCarthy, P.; Alfaro-Magro, F.; Radosavljevic, L.; Miller, K. L.; Jbabdi, S.; Woolrich, M. W.; Gong, W.; Beckmann, C. F.; Elliott, L. T.; Nichols, T. E.; Smith, S. M.
Show abstract
Population-scale neuroimaging allows for novel biological discovery, but voxelwise analyses are computationally paralyzing and noisy, whereas imaging-derived phenotypes discard crucial spatial detail. We introduce PANDORA (Population Archive of Neuroimaging Data Organized for Rapid Analysis), a data-adaptive modelling platform designed to resolve this trade-off. PANDORA has encoded brain MRI data comprising 98 sub-modalities from over 80,000 UK Biobank participants in a highly efficient supervoxel representation. By performing statistical regressions directly within this compressed embedding, PANDORA reduces storage by up to 99% and accelerates computation 10-fold, while acting as a spatial denoiser to enhance statistical power. PANDORA also includes the full-resolution voxelwise ground-truth data, curated imaging confound variables, and a fast analysis tool achieving whole brain, voxelwise population-level regression in seconds to minutes. We showcase PANDORAs ability to reproduce known patterns and reveal new associations including trauma, anxiety/depression, autism polygenic scores, and EPHA3.
Yang, C.; Feng, J.; Beckmann, C. F.; Smith, S. M.; Gong, W.
Show abstract
Neuroimaging faces a reproducibility crisis, where studies on small, heterogeneous datasets produce unreliable brain-wide associations and AI models that fail to generalize. To address this, we introduce GenBrain, a generative foundation model pretrained on approximately 1.2 million 3D scans from over 44,000 individuals across 34 imaging modalities to learn a population prior of brain structure and function. Crucially, GenBrain enables rapid, data-efficient adaptation, allowing any targeted study to generate biologically valid synthetic cohorts, conditioned on demographics, disease status, or other modalities, to augment statistical power and enhance generalizability. We demonstrate GenBrains transformative utility across 81 independent datasets spanning diverse populations, protocols, and clinical conditions. For image-level tasks, it achieves state-of-the-art performance in image enhancement and cross-modality synthesis while preserving subject-specific neurobiology. In population neuroscience, synthetic cohorts from GenBrain stabilize effect-size estimates and significantly improve the reproducibility of brain-wide association studies. For clinical AI, disease-specific fine-tuning of GenBrain substantially boosts the cross-site generalizability of prediction models. Finally, we prove its direct translational value when adapted to unseen modality and scarce clinical stroke data. GenBrain significantly improves predictions of acute stroke severity and chronic aphasia, demonstrating actionable utility under extreme data scarcity. By empowering small-scale studies with large-scale population priors, GenBrain provides a unified framework for more reproducible and clinically generalizable neuroimaging analysis.
De Silva, N.; perez, a.
Show abstract
The size of molecular dynamics (MD) trajectories remains a major obstacle for data sharing, long-term storage, and ensemble analysis at scale. Existing solutions often rely on frame subsampling or reduced atom representations, which limit the utility of shared datasets. Here, we present MDZip, a neural compression framework based on convolutional autoencoders trained per system to reconstruct atomic trajectories with high geometric fidelity from compact latent representations. MDZip achieves over 95% reduction in storage size across a diverse benchmark of proteins, protein-peptide complexes, and nucleic acids. Despite operating in a physics-agnostic manner, the reconstructed trajectories accurately preserve ensemble-level features, including RMSD fluctuations, pairwise distance distributions, radius of gyration, and projections onto principal and time-lagged independent components. A residual (skip-connected) autoencoder variant consistently improves reconstruction accuracy and reduces outliers. While local structural deviations can impair energetic fidelity, short energy minimization partially recovers physically reasonable conformations. This framework enables customizable compression-accuracy trade-offs and supports a modular workflow for sharing latent representations, decoder models, and reconstruction protocols. MDZip offers a scalable solution to current storage limitations, facilitating broader dissemination of MD data without sacrificing essential dynamical information.
Banda, J. M.; Adderley, N.; Ahmed, W.-U.-R.; AlGhoul, H.; Alser, O.; Alser, M.; Areia, C.; Cogenur, M.; Fister, K.; Gombar, S.; Huser, V.; Jonnagaddala, J.; Lai, L.; Leis, A.; Mateu, L.; Mayer, M. A.; Minty, E.; Morales, D. R.; Natarajan, K.; Paredes, R.; Periyakoil, V. S.; Prats-Uribe, A.; Ross, E. G.; Singh, G. V.; Subbian, V.; Vivekanantham, A.; Prieto-Alhambra, D.
Show abstract
As the SARS-CoV-2 virus (COVID-19) continues to affect people across the globe, there is limited understanding of the long term implications for infected patients1-3. While some of these patients have documented follow-ups on clinical records, or participate in longitudinal surveys, these datasets are usually designed by clinicians, and not granular enough to understand the natural history or patient experiences of long COVID. In order to get a complete picture, there is a need to use patient generated data to track the long-term impact of COVID-19 on recovered patients in real time. There is a growing need to meticulously characterize these patients experiences, from infection to months post-infection, and with highly granular patient generated data rather than clinician narratives. In this work, we present a longitudinal characterization of post-COVID-19 symptoms using social media data from Twitter. Using a combination of machine learning, natural language processing techniques, and clinician reviews, we mined 296,154 tweets to characterize the post-acute infection course of the disease, creating detailed timelines of symptoms and conditions, and analyzing their symptomatology during a period of over 150 days.
Mills, C.; Kraemer, M. U. G.; Donnelly, C. A.
Show abstract
Understanding the past, current, and future dynamics of dengue epidemics is challenging yet increasingly important for global public health. Using data from northern Peru across 2010 - 2021, we introduce a multi-model approach that integrates new and existing techniques for understanding and predicting dengue epidemics. Using wavelet analyses, we unveil spatiotemporal patterns and estimate space-varying epidemic drivers across shorter and longer dengue cycles, while our Bayesian hierarchical model allows us to quantify the timing, structure, and intensity of such climatic influences. For forecasting, as a single model is generally sub-optimal, we introduce trained and untrained probabilistic ensembles. In settings that mirror real-world implementations, we develop climate-informed and covariate-free deep learning forecasting models involving foundational time series, temporal convolutional networks, and conformal inference. We complement modern techniques with statistically principled training, assessment, and benchmarking of ensembles, alongside interpretable metrics for outbreak detection to disseminate outputs with communities and public health authorities. Our ensembles generally outperformed individual models across space and time. Looking forward, whether the public health objective is to learn from the past and/or to predict future dengue epidemic dynamics, our multi-model approach can be used to inform the decision-making of public health authorities.
Steyn, N.; Parag, K. V.
Show abstract
The instantaneous reproduction number (Rt) is a key measure of the rate of spread of an infectious disease. Correctly quantifying uncertainty in Rt estimates is crucial for making well-informed decisions. Popular Rt estimators leverage smoothing techniques to distinguish signal from noise. Examples include EpiEstim and EpiFilter, which are both controlled by a "smoothing parameter" that is traditionally selected by users. We demonstrate that the values of these smoothing parameters are unknown, vary markedly with epidemic dynamics, and show that data-driven smoothing is crucial for accurate uncertainty quantification of Rt estimates. We derive model likelihoods for the smoothing parameters in both EpiEstim and EpiFilter and develop a Bayesian framework to automatically marginalise these parameters when fitting to epidemiological time-series data. This yields novel marginal posterior predictive distributions which prove integral to rigorous model evaluation. Applying our methods, we find that default parameterisations of these widely-used estimators can negatively impact Rt inference, delaying detection of epidemic growth, and misrepresenting uncertainty (typically producing overconfident estimates), with implications for public health decision-making. Our extensions mitigate these issues, provide a principled approach to uncertainty quantification, improve the robustness of real-time Rt inference, and facilitate model comparison using observable quantities.
Olchanyi, M.; Schreier, D. R.; Li, J.; Maffei, C.; Sorby-Adams, A.; Kinney, H. C.; Healy, B. C.; Freeman, H. J.; Shless, J.; Destrieux, C.; Tregidgo, H.; Iglesias, J. E.; Brown, E. N.; Edlow, B. L.
Show abstract
Brainstem white matter bundles are essential conduits for neural signaling involved in modulation of vital functions ranging from homeostasis to human consciousness. Their architecture forms the anatomic basis for brainstem connectomics, subcortical mesoscale circuit models, and deep brain navigation tools. However, their small size and complex morphology compared to cerebral white matter structures makes mapping and segmentation challenging in neuroimaging. This results in a near absence of automated brainstem white matter tracing methods. We leverage diffusion MRI tractography to create BrainStem Bundle Tool (BSBT), which segments eight key white matter bundles in the rostral brainstem. BSBT performs automated segmentation on a custom probabilistic fiber map generated from tractography with a convolutional neural network architecture tailored for detection of small structures. We demonstrate BSBTs robustness across diffusion MRI acquisition protocols through validation on healthy subject in vivo scans and ex vivo scans of brain specimens with corresponding histology. Using BSBT, we reveal distinct brainstem white matter bundle alterations in Alzheimers disease, Parkinsons disease, and acute traumatic brain injury cohorts through tract-based analysis and classification tasks. Finally, we provide proof-of-principle evidence supporting the prognostic utility of BSBT in a longitudinal analysis of coma recovery. BSBT creates opportunities to automatically map brainstem white matter in large imaging cohorts and investigate its role in a broad spectrum of neurological disorders.
Zhang, C.; Leach, A.; Makkink, T.; Arbesu, M.; Kadri, I.; Luo, D.; Mizrahi, L.; Krichen, S.; Lang, M.; Tovchigrechko, A.; Lopez Carranza, N.; Sahin, U.; Beguir, K.; Rooney, M.; Fu, Y.
Show abstract
Protein structure prediction field has been revolutionised by deep learning with protein folding models such as AlphaFold 2 and ESMFold. These models enable rapid in silico prediction and have been integrated into de novo protein design and protein-protein interaction (PPI) prediction. However, biologically relevant features dependent on conformational distributions cannot be estimated with these models. Diffusion models, a novel class of generative models, have been developed to learn conformational distributions and applied to de novo protein design. Limited work has been done on protein structure inpainting, where a masked section is recovered by simultaneously conditioning on its sequence and the rest of the structure. In this work, we propose FrameDiff inPainTing (FrameDiPT), a generalised model for protein inpainting. This is important for T-cells given the hyper-variability of the complementarity determining region (CDR) loops. We evaluated the model on CDR loop design for T-cell receptors and achieved comparable prediction accuracy to ProteinGenerator and RFdiffusion with limited training data and learnable parameters. Different from deterministic structure prediction models, FrameDiPT captures the conformational distribution at different regions and binding states, highlighting a key advantage of generative models. The model and inference code have been released1.
Thakur, L. S.; Bharj, G.; Sangabattula, L.; Malik, B.
Show abstract
Multimodal biomedical datasets, such as those from neurodegenerative disease cohorts, present significant challenges in stratifying heterogeneous patient populations due to missing values, high dimensionality, and modality-specific biases. Traditional clustering methods often require extensive preprocessing and fail to integrate heterogeneous data types effectively. We introduce ci-fGBD (Cluster-Integrated Fast Generalized Bruhat Decomposition), a novel matrix factorization and clustering framework that natively operates on block-structured, multimodal datasets. ci-fGBD extends the classical Bruhat decomposition by jointly learning latent representations and patient clusters while automatically harmonizing contributions across diverse modalities, including neuroimaging, cognitive assessments, genomics, wearable sensors, and environmental exposures. Benchmarking against standard methods on real datasets demonstrates that ci-fGBD consistently identifies clinically meaningful subgroups, capturing subtle biological, cognitive, and demographic heterogeneity in Alzheimers disease cohorts with superior interpretability and robustness.
Sorensen, I. F.; Sorensen, P.
Show abstract
We present multivariate Bayesian regression models specifically designed to over-come data-sharing barriers in health and genomics. These multi-response models are well suited for scenarios where data must remain decentralized due to privacy, intellectual property, or regulatory constraints. In extensive simulation studies, our approach consistently outperformed traditional single-response models trained on individual datasets, particularly under real-world conditions such as low signal, unbalanced cohorts, and high-dimensional feature spaces. For the first time, we demonstrate that multivariate Bayesian regression can be implemented using or-thogonal transformations of sufficient statistics, enabling fully privacy-preserving analysis without sharing individual-level data. The models are scalable, inter-pretable, and applicable to predictive tasks across diverse collaborators, supporting secure data-driven research in domains such as clinical trials, biomarker discovery, and precision health.
Jackson, N. J.; Espinosa-Dice, N.; Yan, C.; Malin, B. A.
Show abstract
Synthetic data generation is a promising approach for biomedical data sharing and dataset augmentation, yet existing methods lack mechanisms to preserve statistical properties necessary for scientific analysis. To address this, we introduce RLSYN+REG, a reinforcement learning-driven generative model, which encourages that regression models trained on synthetic data reproduce the coefficients and predictions of their real-data counterparts. We evaluate RL-SO_SCPLOWYNC_SCPLOW+RO_SCPLOWEGC_SCPLOW on MIMIC-III and the American Community Survey (ACS) across regression model reproduction, fidelity to real data, and privacy. Synthetic data from RLSO_SCPLOWYNC_SCPLOW+RO_SCPLOWEGC_SCPLOW substantially improves upon that of RLSO_SCPLOWYNC_SCPLOW, raising correlations between real and synthetic regression coefficients from 0.054 to 0.600 on MIMIC-III and from 0.160 to 0.376 on ACS. Predictive performance also improves, reducing the gap between real-data baselines by 81.4% and 97.6% on MIMIC-III and ACS, respectively. These improvements come with negligible cost to fidelity or privacy and are robust to reductions in training data.
Neyestanak, M. S.; Burbach, S. M.; Ng, K.; Gangavarapu, P.; Hurtado, J.; Magura, J.; Ismail, N.; Muema, D.; Ndung'u, T.; Ward, A.; Briney, B.
Show abstract
Scaling laws for large language models in natural language domains are typically derived under the assumption that performance is primarily compute-constrained. In contrast, antibody language models (AbLMs) trained on paired sequences are primarily data-limited, thus requiring different considerations. To explore how model size and data scale affect AbLM performance, we trained 15 AbLMs across all pairwise combinations of five model sizes and three training data sizes. From these experiments, we derive an AbLM-specific scaling law and estimate that training a data-optimal AbLM equivalent of the highly performant 650M-parameter ESM-2 protein language model would require [~]5.5 million paired antibody sequences. Evaluation on multiple downstream classification tasks revealed that significant performance gains emerged only with sufficiently large model size, suggesting that in data-limited domains, improved performance depends jointly on both model scale and data volume.
Yao, M.; Praturu, A.; Sharpee, T.
Show abstract
The increasing size of datasets poses challenges for their visualization and interpretation, highlighting the need for scalable and effective analysis methods. Hyperbolic embedding have shown strong potential in capturing complex hierarchical structures across diverse systems. However, existing hyperbolic embedding methods typically operate with fixed curvature and have difficulties scaling to large datasets. To address these limitations, we propose MuH-MDS, a novel multiscale algorithm for hyperbolic multidimensional scaling that uses "adiabatic" approximation from physics to optimize local positions while keeping cluster centroid fixed. MuH-MDS improves computing time by 103 compared to previous methods and is able to handle large datasets comprising over 80, 000 samples. We validate the method on a number of datasets, including a large-scale C. elegans embryogenesis scRNA-seq dataset with over 80,000 samples. Here, MuH-MDS uncovers intrinsic hierarchical structure, and achieves improved pseudotime inference and lineage analysis compared to UMAP and other methods. Unlike UMAP and t-SNE, which emphasize local structure at the expense of global coherence and metric accuracy, MuH-MDS preserves global hierarchy in a metrically faithful manner, maintaining key relationships across scales.
Truong, T. H.; Foster, E.; Fazio, T.; Holper, S.; Verspoor, K. M.
Show abstract
Information extraction (IE) from specialized clinical texts such as brain MRI reports is important for various clinical and population health contexts. However, this topic is under-explored due to privacy concerns limiting data availability and the inherent complexity and domain-specificity of clinical language. Common methods relying on substantial amounts of training data fail. The recent advances in large language model (LLM) research provide a promising solution to bridge the data scarcity gap, with improved ability to adapt to novel tasks with little supervision. We introduce a new, challenging dataset of 100 expert-annotated brain MRI reports, featuring 152 fine-grained entity types and 4 relation types, characterised by low inter-annotator agreement. This task reflects the inherent complexity and real-world ambiguity of medical text. We evaluate a small, open-weight LLM across span detection, named entity recognition, and relation extraction tasks. We compare few-shot prompting and parameter-efficient fine-tuning against specialized off-the-shelf biomedical IE systems. Our results demonstrate that both few-shot and fine-tuned LLM approaches substantially outperform off-the-shelf baselines. While LLMs show superiority, absolute performance, particularly for complex relations and fine-grained entities, remains modest, correlating with the datasets inherent difficulty and the extreme low-resource setting.
Barba, T.; Bagley, B. A.; Steyaert, S.; Carrillo-Perez, F.; Sadee, C.; Iv, M.; Gevaert, O.
Show abstract
Magnetic resonance images (MRI) of the brain exhibit high dimensionality that pose significant challenges for computational analysis. While models proposed for brain MRIs analyses yield encouraging results, the high complexity of neuroimaging data hinders generalizability and clinical application. We introduce DUNE, a neuroimaging-oriented encoder designed to extract deep-features from multisequence brain MRIs, thereby enabling their processing by basic machine learning algorithms. A UNet-based autoencoder was trained using 3,814 selected scans of morphologically normal (healthy volunteers) or abnormal (glioma patients) brains, to generate comprehensive low-dimensional representations of the full-sized images. To evaluate their quality, these embeddings were utilized to train machine learning models to predict a wide range of clinical variables. Embeddings were extracted for cohorts used for the model development (n=21,102 individuals), along with 3 additional independent cohorts (Alzheimers disease, schizophrenia and glioma cohorts, n=1,322 individuals), to evaluate the models generalization capabilities. The embeddings extracted from healthy volunteers scans could predict a broad spectrum of clinical parameters, including volumetry metrics, cardiovascular disease (AUROC=0.80) and alcohol consumption (AUROC=0.99), and more nuanced parameters such as the Alzheimers predisposing APOE4 allele (AUROC=0.67). Embeddings derived from the validation cohorts successfully predicted the diagnoses of Alzheimers dementia (AUROC=0.92) and schizophrenia (AUROC=0.64). Embeddings extracted from glioma scans successfully predicted survival (C-index=0.608) and IDH molecular status (AUROC=0.92), matching the performances of previous task-oriented models. DUNE efficiently represents clinically relevant patterns from full-size brain MRI scans across several disease areas, opening ways for innovative clinical applications in neurology. One Sentence SummaryWe propose a brain MRI-specialized encoder, which extracts versatile low-dimension embeddings from full-size scans.